Methodology

A new article created using the Distill format.

Response to VAST Challenge quesions

In this section, we will employ appropriate visually driven data analysis techniques to answer the questions in the challenge. We will also explore the various packages required to build the plots. The criteria for selection of plots are as follows: - Level of customization - Ease of use and implementation of customization - Ease of understanding and interpretation of the plot- both clarity and aesthetic - Interactivity

Based on the interactive charts below, the top 3 most popular locations by transaction volume are listed below:

1 | Katerina’s Cafe | 01/11/2014
2 | Hippokampos | 01//2014
3 | Guy’s Gyros | 01//2014

Ranking | Location | Date

1 | Katerina’s Cafe | 01/11/2014
2 | Hippokampos | 01//2014
3 | Guy’s Gyros | 01//2014

When comparing both datasets, I also noted that there are differences in the transaction count on the loyalty card and credit card. In particular, there were days where loyalty card transactions were higher than credit card transactions. This is unexpected as loyalty card is used to collect discounts and rewards and cannot be used for payment. Hence one would expect both volumes to either be the same or for credit card volumes (actual purchase) to be higher than loyalty card volumes (in cases where the employee may have forgotten to present loyalty card for rewards/ discounts). The difference in volumes each day across both cards are illustrated below.

We will then analyze the transaction volume by day of week to observe volume trends across the week.

1 | Katerina’s Cafe | Tue, Thu, Sat 2 | Hippokampos | Mon, Wed, Thu 3 | Guy’s Gyros | Mon, Thu, Fri

Given that we are provided with credit card timestamp information, we will take one step further to analyze the volume of credit card transactions by location and time.

1 | Katerina’s Cafe | Mon, Tue, Sat | 1700-2000 across weekday and weekends 2 | Hippokampos | Mon, Wed, Thu | Most popular on 1300-1600 on weekdays and 1700-2000 on weekends 3 | Guy’s Gyros | Mon, Thu, Fri | 1700-2000 across weekday and weekends except for Friday where 1300-1600 is most popular

[1] "English_United States.1252"

As observed above, popular day of week differs for some of the locations such as Katerina’s Cafe. This is unexpected as we would expect the trends to be similar for both cards.

Also, based on the timestamp of credit card transactions, we noted that all of the transactions in “Bean There Done That”, “Brewed Awakenings”, “Coffee Shack” and “Jack’s Magical Beans” during the period of transactions- 12:00pm. It is highly unlikely that all transactions in these locations are transacted at the same time. Hence, the timestamp for these transactions may be incorrect.

Given that these timestamps may not be representative of the actual transaction time, we will not be using this information for further analysis subsequently.

Furthermore, we also noted that there are several transactions in Kronos Mart at 3am on 13 January and 19 January. This is highly unusual and warrants further investigation.

Other anomalies noted from the data are as follows:

  1. As seen in the datatables below, there are 55 credit cards but only 54 loyalty card spending information provided. This is unusual as all employees are provided with a loyalty card. This discrepancy may arise because the employee does not want his/ her movement to be tracked and hence is avoiding using the loyalty card, or that employees are using more than 1 credit card for their purchases.
  1. Timestamp data in credit card data set was provided in the datetime format while timestamp data in loyalty card data set was in the date format. This made it harder to compare between both datasets, In order to overcome this, I grouped the credit card data by date so as to align the information with that in the loyalty card dataset for comparability.

Qn 2- Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?

Assuming that the car assignment list provided includes all employees, we noted that there are 44 distinct employees. However, we noted that there are 55 distinct credit card numbers and 54 distinct loyalty card numbers. This is unusual as each employee should have been issued a loyalty card and hence we would expect number of distinct credit cards, loyalty cards and employee count to match.

More investigation should be made into this discrepancy. One explanation could be that employees could have used more than one credit card with their loyalty card. Another explanation could be that there is a new employee who has not received the loyalty card. Given that the employee count is different from number of distinct loyalty cards, we should check with Gastech if there are any employees missing from this list.

From the car assignment dataset provided, we observe that there are nine truck drivers with no ID. This is consistent with what Gastech has explained, which is that employees who do not have company cars have the ability to check out company trucks for business use, but these trucks cannot be used for personal business.

The case scenario does not state which CarIDs are referring to trucks. However, assuming that the 3 digit CarID represents trucks, we only note GPS data for five trucks. There is no evidence as to whether the truck ID is sequential or if each truck driver is assigned to a truck. Given that there are 9 truck drivers and only 5 truck GPS data provided, there is possibility that: 1) Each truck driver is not assigned to a unique truck and trucks can be shared. 2) There are 4 GPS paths missing in the GPS dataset

To perform further investigation on this, we will plot the GPS paths of each carID over the Abila map to identify their route.

Working with geospatial data

Georeferencing

  1. Download and launch QGIS, an open-sourced GIS software.

  2. Start a new project by clicking on Project> New.

Figure 3: Start new project
  1. Add a vector layer. Navigate to Add Layer > Add Layer > Add Vector Layer
Figure 4: Add vector layer
  1. Click on the “…” button and navigate to location of Abula.shp file. Click on Add. You should see Figure 6 in your main pane.
Figure 5: Select SHP file
Figure 6: Current map
  1. To make the map clearer, we will then change the lines from green to black. Right click on Abila under the layers panel > select Properties > select Symbology > click on the arrow dropdown next to Color and select black > Click Apply and Ok.
Figure 7: Select Properties
Figure 8: Change line color
  1. To perform the georeferencing, click on Raster on the top pane > Georeferencer. When the new window appears, click on the blue square symbol on the top left to access the image file. Navigate to the MC2- tourist JPG file. In order to perform georeferencing, we need to select reference points (at least 6 control points) from the tourist map to match to the corresponding points on the SHP file map.

Figure 9: Perform georeferencing Figure 10: Perform georeferencing

  1. To match the corresponding points, click on the identify tool on the top pane. Hover over the suspected corresponding point and click on it. The Identify Results pane appears on the right corner. Observe the results by matching the road name specified on the Identify Results pane to the tourist map.
Figure 11: Use the identify function to match corresponding points
Figure 12: Observe results to ensure correct point has been selected
  1. Once checked to be correct, under the Georeferencer, click on the selected point on the tourist map. On the Enter Map Coordinates window that appears, click on From Map Canvas and hover over the Shp file map. Ensure the crosshair is as close as possible to the actual point. By clicking on the point, this captures the X and Y coordinates. Click Ok. The GCP table in the Georeferencer window will be updated with the first reference point. Repeat the above steps for the other cross-reference points.
Figure 13: Select point on the tourist map
Figure 14: Update map coordinates
Figure 15: GCP table
  1. To check the settings, select Settings > Transformation Settings. Select the following settings as specified in Figure 17. Note that if the Target SRS is not set to WGS 84, click on the globe symbol next to the field and type “4326” under the filter pane. Select WGS 84 when it appears as the filtered result. Under output settings, ensure results are saved in a TIF file format for usage subsequently. Click on the tickbox next to “Save GCP points” and “Load in QGIS when done”. Click ok.
Figure 16: GCP table
Figure 17: Update settings
Figure 18: Update Target SRS
  1. Select file on the top pane > select Start Georeferencing. Once georeferencing is completed, minimise the georeferencer pane and switch back to the map.
Figure 19: Start georeferencing
  1. Under the layers pane on the left, drag the image file below the Abila streetmap so that the streetmap can be plotted on top of the tourist map. Doing a check, we observe that the streetmap is well-aligned with the image file. The TIF file created can then be used in RStudio as a raster object.
Figure 20: Rearrange layers
Figure 21: Georeferenced Map
After preparing the georeferencing, we will then import the raster layer into RStudio.
class      : RasterLayer 
band       : 1  (of  3  bands)
dimensions : 1595, 2706, 4316070  (nrow, ncol, ncell)
resolution : 3.16216e-05, 3.16216e-05  (x, y)
extent     : 24.82419, 24.90976, 36.04499, 36.09543  (xmin, xmax, ymin, ymax)
crs        : +proj=longlat +datum=WGS84 +no_defs 
source     : MC2-tourist.tif 
names      : MC2.tourist 
values     : 0, 255  (min, max)

Using st_read() of sf package, import Abila shapefile into R. We will then convert aspatial data to simple feature dataframe
Reading layer `Abila' from data source 
  `D:\stellaloh91\Assignment\data\Geospatial' using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS:  WGS 84
# A tibble: 685,169 x 10
   Timestamp           id      lat  long   day date       minute
   <dttm>              <fct> <dbl> <dbl> <int> <date>      <int>
 1 2014-01-06 06:28:01 35     36.1  24.9     6 2014-01-06     28
 2 2014-01-06 06:28:01 35     36.1  24.9     6 2014-01-06     28
 3 2014-01-06 06:28:03 35     36.1  24.9     6 2014-01-06     28
 4 2014-01-06 06:28:05 35     36.1  24.9     6 2014-01-06     28
 5 2014-01-06 06:28:06 35     36.1  24.9     6 2014-01-06     28
 6 2014-01-06 06:28:07 35     36.1  24.9     6 2014-01-06     28
 7 2014-01-06 06:28:09 35     36.1  24.9     6 2014-01-06     28
 8 2014-01-06 06:28:10 35     36.1  24.9     6 2014-01-06     28
 9 2014-01-06 06:28:11 35     36.1  24.9     6 2014-01-06     28
10 2014-01-06 06:28:12 35     36.1  24.9     6 2014-01-06     28
# ... with 685,159 more rows, and 3 more variables:
#   day_of_week <weekday>, hour <int>, timegroup <fct>
Next, join the gps points into movement paths by using the drivers’ IDs as unique identifiers.
Checking the data, we noticed single coordinates pair in the line feature. The following code chunk is to identify and remove the orphan lines.
Lastly, we then overplot the selected gps path onto the background tourist map.

By plotting the GPS coordinates using the Abila tourist map as background, we are able to visualize the path each vehicle is using. The map is also interactive. Clicking on any point in the trajectory allows us to see the CarID, day and timestamp of the respective route. This allows us to match the timestamp and location back to the credit card dataset, hence matching the credit card numbers to their corresponding CarID.

We have used a facet map below to visualize the daily route for CarID 1 across each of the 14 days of GPS data in record. This allows for easy obervation and matching to the credit card data.

After mapping the GPS trajectories, We also noted that there were no GPS data indicating that any car stopped near “Bean There Done That”, “Brewed Awakenings”, “Coffee Shack” and “Jack’s Magical Beans” during the period of transactions- 12:00pm. Hence, these transactions are either incorrectly timed or may even be fraudulent.

Furthermore as mentioned above, we also noted that there are several transactions in Kronos Mart at 3am on 13 January and 19 January. By filtering the map for 3am, we noted that there were no cars near Kronos Mart. Hence, these transactions are either incorrectly timed or may even be fraudulent.